Preprocessing

Constructing the ExpressionSet Class

The ExpressionSet class (ex_sc) is a convenient data structure that contains 3 dataframes. These dataframes contain expression data, cell information, and gene information respectively.

exprs(ex_sc) is the expression data, where rows are genes and columns are cells
pData(ex_sc) is cell information, where rows are cells and columns are metadata
fData(ex_sc) is gene information, where rows are genes and columns are metadata

ex_sc <- construct_ex_sc(mDC_UMI_Table) # sc_dat == Input expression matrix
ex_sc # Note that phenoData and featureData are empty right now!
ExpressionSet (storageMode: lockedEnvironment)
assayData: 11584 features, 3814 samples 
  element names: exprs 
protocolData: none
phenoData: none
featureData: none
experimentData: use 'experimentData(object)'
Annotation:  
ex_sc_skin
ExpressionSet (storageMode: lockedEnvironment)
assayData: 14757 features, 3698 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: CB22_H1_ATAACAGGATCAGGGA CB22_H2_AAGCTCCTCTCACATC ...
    VB65_NL_CTGGGTATAAACACGG (3698 total)
  varLabels: Patient Skin
  varMetadata: labelDescription
featureData
  featureNames: A1BG A1CF ... ZZZ3 (14757 total)
  fvarLabels: Gene Coding
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:  
exprs(ex_sc)[1:5,1:5]
              0hrA_TGACGGACAAGTAATC 0hrA_CACAACAGTAGCCTCG 0hrA_GTTTGTTTGCACCTCT
0610007P14Rik                     0                     0                     0
0610009B22Rik                     0                     0                     0
0610009O20Rik                     0                     0                     0
0610010B08Rik                     0                     0                     0
0610010F05Rik                     0                     0                     0
              0hrA_GCTTACCTTGACCCTC 0hrA_GGAGAAGCGCTTTGGC
0610007P14Rik                     0                     0
0610009B22Rik                     0                     0
0610009O20Rik                     0                     0
0610010B08Rik                     0                     0
0610010F05Rik                     0                     0
pData(ex_sc)[1:5,]
data frame with 0 columns and 5 rows
fData(ex_sc)[1:5,]
data frame with 0 columns and 5 rows
rm(mDC_UMI_Table)

Data Annotation

Often we have metadata about the experiment that can be valuable in the analysis! Writing that information now may be appropriate. Our experiment consists of a time course with LPS stimulation.

pData(ex_sc)$Timepoint <- NA # initialize a new pData column

pData(ex_sc)[grep("0hr", rownames(pData(ex_sc))),"Timepoint"] <- "0hr"
pData(ex_sc)[grep("1hr", rownames(pData(ex_sc))),"Timepoint"] <- "1hr"
pData(ex_sc)[grep("4hr", rownames(pData(ex_sc))),"Timepoint"] <- "4hr"

head(pData(ex_sc))
                      Timepoint
0hrA_TGACGGACAAGTAATC       0hr
0hrA_CACAACAGTAGCCTCG       0hr
0hrA_GTTTGTTTGCACCTCT       0hr
0hrA_GCTTACCTTGACCCTC       0hr
0hrA_GGAGAAGCGCTTTGGC       0hr
0hrA_AAATCAGAGATCTCGG       0hr

Filtering

The first step is to filter your data to remove low quality cells. Often creating a histogram of the values and assigning cutoffs is simple and effective. Typically we remove all cells lower than 500-1000 UMIs / cell.

ex_sc <- calc_libsize(ex_sc, suffix = "raw") # sums counts for each cell

ex_sc <- pre_filter(ex_sc, minUMI = 1000, maxUMI = 10000, threshold = 1, minCells = 10,  print_progress = TRUE) # filters cells and genes
[1] "Filtering Genes"
[1] "Filtering Low Cells"
ex_sc <- calc_libsize(ex_sc, suffix = "raw_filtered")
plot_density(ex_sc, title = "UMI Density",  val = "UMI_sum_raw_filtered", statistic = "mean")  

Basic scRNA-seq analysis

Dimension reduction

Dimensionality reduction is necessary in order to bring the cells from a high dimensional gene expression space (~10k dimensions) down to a more reasonable number. Typically this is done first with PCA ~5-15 dimensions, before a final embedding is done using tSNE or UMAP to bring it down to 2 dimensions.

gene_subset <- subset_genes(ex_sc, method = "PCA", threshold = 1, minCells = 20, nComp = 10, cutoff = 0.85) 

ex_sc <- dim_reduce(ex_sc, genelist = gene_subset, pre_reduce = "iPCA", nComp = 10, tSNE_perp = 30, iterations = 500, print_progress=TRUE) 
[1] "Starting iPCA"
[1] "Starting tSNE"
Read the 3812 x 10 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.58 seconds (sparsity = 0.033243)!
Learning embedding...
Iteration 50: error is 85.934926 (50 iterations in 0.86 seconds)
Iteration 100: error is 80.296177 (50 iterations in 0.81 seconds)
Iteration 150: error is 80.013512 (50 iterations in 0.77 seconds)
Iteration 200: error is 79.974289 (50 iterations in 0.79 seconds)
Iteration 250: error is 79.965020 (50 iterations in 0.77 seconds)
Iteration 300: error is 2.469057 (50 iterations in 0.60 seconds)
Iteration 350: error is 2.128633 (50 iterations in 0.57 seconds)
Iteration 400: error is 1.964351 (50 iterations in 0.56 seconds)
Iteration 450: error is 1.867714 (50 iterations in 0.57 seconds)
Iteration 500: error is 1.805818 (50 iterations in 0.57 seconds)
Fitting performed in 6.87 seconds.
colnames(pData(ex_sc))
 [1] "Timepoint"            "UMI_sum_raw"          "UMI_sum_raw_filtered"
 [4] "x"                    "y"                    "iPC_Comp1"           
 [7] "iPC_Comp2"            "iPC_Comp3"            "iPC_Comp4"           
[10] "iPC_Comp5"            "iPC_Comp6"            "iPC_Comp7"           
[13] "iPC_Comp8"            "iPC_Comp9"            "iPC_Comp10"          
plot_tsne_metadata(ex_sc, color_by = "Timepoint") 

plot_tsne_metadata(ex_sc, color_by = "iPC_Comp1", title = "PC1 cell loadings") 

plot_tsne_metadata(ex_sc, color_by = "iPC_Comp2", title = "PC2 cell loadings") 

plot_tsne_metadata(ex_sc, color_by = "iPC_Comp3", title = "PC3 cell loadings") 

Excersise 1 : Dimension reduction on Skin Data

Now let us try dimension for the skin data!! First we can inspect the skin data to get a sense of what we are working with. Once we get a sense for the data we can then use the same functions as above to perform gene selection and dimension reduction.

dim(ex_sc_skin)
Features  Samples 
   14757     3698 
colnames(pData(ex_sc_skin))
[1] "Patient" "Skin"   
colnames(fData(ex_sc_skin))
[1] "Gene"   "Coding"
plyr::count(pData(ex_sc_skin))
   Patient          Skin freq
1     CB17     1-Healthy   53
2     CB19     1-Healthy  115
3     CB20     1-Healthy  153
4     CB21     1-Healthy  107
5     CB22     1-Healthy  602
6     CB23     1-Healthy   13
7     CB27     1-Healthy  235
8    VB114 2-NonLesional   27
9    VB114     3-Disease   14
10   VB126 2-NonLesional   25
11   VB126     3-Disease  452
12   VB130 2-NonLesional   48
13   VB130     3-Disease  203
14   VB131 2-NonLesional   35
15   VB131     3-Disease    7
16   VB132 2-NonLesional   24
17   VB132     3-Disease   20
18    VB65 2-NonLesional  134
19    VB65     3-Disease  234
20    VB66 2-NonLesional   46
21    VB66     3-Disease   92
22    VB75 2-NonLesional   17
23    VB75     3-Disease   13
24    VB76 2-NonLesional  153
25    VB76     3-Disease   80
26    VB77 2-NonLesional  180
27    VB77     3-Disease  449
28    VB85 2-NonLesional   19
29    VB85     3-Disease  148
# Use the subset_genes function to find variable genes in the skin data. Be sure to provide the input argument and method argument. Use ?subset_genes() to view help pages for functions. Try different methods and compare the number of genes you get out for each method.

gene_subset <- subset_genes(ex_sc_skin, method = "PCA", threshold = 1, minCells = 30)

# Use the dim_reduce function to create a 2D representation of the skin data. Be sure to provide the input argument and a gene list.

ex_sc_skin <- dim_reduce(ex_sc_skin, genelist = gene_subset, pre_reduce = "iPCA", nComp = 10, iterations = 500)
[1] "Starting iPCA"
[1] "Starting tSNE"
Read the 3698 x 10 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.56 seconds (sparsity = 0.033051)!
Learning embedding...
Iteration 50: error is 85.439689 (50 iterations in 1.33 seconds)
Iteration 100: error is 76.362799 (50 iterations in 1.10 seconds)
Iteration 150: error is 75.479012 (50 iterations in 0.67 seconds)
Iteration 200: error is 75.264499 (50 iterations in 0.65 seconds)
Iteration 250: error is 75.183824 (50 iterations in 0.62 seconds)
Iteration 300: error is 2.357471 (50 iterations in 0.50 seconds)
Iteration 350: error is 2.036250 (50 iterations in 0.49 seconds)
Iteration 400: error is 1.883499 (50 iterations in 0.50 seconds)
Iteration 450: error is 1.798063 (50 iterations in 0.50 seconds)
Iteration 500: error is 1.743551 (50 iterations in 0.52 seconds)
Fitting performed in 6.87 seconds.
# Now you can plot metadata from pData() or genes of interest, onto the tSNE mapping.

plot_tsne_metadata(ex_sc_skin, color_by = "Patient", facet_by = "Skin")

Clustering

Now that we have dimension reduced data we can try clustering it! For dimensions, both Comp and 2d are supported. There will determine if the clustering is done on principal components, or on the 2D representation. There are also 2 clustering algorithms available, density and spectral. Typically we recommend spectral clustering on PCA components, or density clustering on the 2d representation. Try both!

ex_sc <- cluster_sc(ex_sc, dimension = "Comp", method = "spectral", num_clust = 4) 
ex_sc$cluster_spectral <- ex_sc$Cluster

ex_sc <- cluster_sc(ex_sc, dimension = "2d", method = "density", num_clust = 4) 
Distance cutoff calculated to 0.05219639 
ex_sc$cluster_density <- ex_sc$Cluster

plot_tsne_metadata(ex_sc, color_by = "cluster_spectral")

plot_tsne_metadata(ex_sc, color_by = "cluster_density")

Excercise 2 : Clustering on the Skin Data

Now let us try clustering for the skin data!! Try both density based and spectral clustering!

ex_sc_skin <- cluster_sc(ex_sc_skin, dimension = "Comp", method = "spectral", num_clust = 4) 

plot_tsne_metadata(ex_sc_skin, color_by = "Cluster")

Cell Type Identification

There are many possible ways to identify cell types based on their gene expression. The id_markers function will identify genes that are highly expressed in a high proportion of a given cluster, relative to the other clusters.

ex_sc <- id_markers(ex_sc, print_progress = TRUE) # This is a quick method to find good markers genes for cell identification. These gene scores get written to fData()
[1] "Finding markers based on fraction expressing"
[1] "Finding markers based on mean expressing"
[1] "Merging Lists"
ex_sc <- calc_agg_bulk(ex_sc, aggregate_by = "Cluster")

markers <- return_markers(ex_sc, num_markers = 15) 

plot_heatmap(ex_sc, genes = unique(unlist(markers)), type = "bulk")

[[1]]


[[2]]

Call:
hclust(d = d, method = "complete")

Cluster method   : complete 
Distance         : euclidean 
Number of objects: 60 

Excercise 3 : Cell Type identification Skin Data

Now try to identify the cell types in the skin data!

# ex_sc_skin <- id_markers()

# markers <- return_markers() 

# ex_sc_skin <- calc_agg_bulk()

# plot_heatmap()

Normalization

Now that the data has preliminary clusters, we can normalize. SCRAN normalization will first normalize internally in clusters, before normalizing across clusters. Once the data is normalized we can run the same steps as above before visualization. The first step is to select the genes to be used for normalization. One method would be to first only use genes expressed in more than n cells, and then remove the most variable genes. This method can be computationally expensive, and is currently commented out. A simpler approach, counts per million, is also provided below.

table(pData(ex_sc)$Cluster)

Cluster1 Cluster2 Cluster3 Cluster4 
     942      909     1624      337 
# ex_sc_norm <- norm_sc(ex_sc, pool_sizes = c(20,25,30,35,40))

x <- exprs(ex_sc)
cSum <- apply(x,2,sum)    # recompute for remaining cells
x <- as.matrix(sweep(x,2,cSum,FUN='/'))*1e6    # normalize to UMIs per million
ex_sc_norm <- construct_ex_sc(x)
pData(ex_sc_norm) <- pData(ex_sc)

Excercise 4 : Process the normalized mouse data!

Now that we have normalized, it is time to reprocess the data as before, this time on the normalized counts! Use the ex_sc_norm normalized counts from above to run through the same processing steps as before.

# gene_subset <- subset_genes()

# ex_sc_norm <- dim_reduce()

# ex_sc_norm <- cluster_sc()

# plot_tsne_metadata()

# plot_tsne_gene()

Advanced scRNA-seq analysis

Supervised Analysis

From the above analysis, it is clear that some clusters are formed based on their cell type, while others are based on their experimental condition. In these cases it can be useful to incorporate prior information in order to obtain clusters and annotations that are grounded in biological significance. Below, we can assign “panels” similar to flow cytometry, that will enable cell groupings based on the expression of genes that you believe to signify biologically relevenant cell types.

panel1 <- c("S100a9", "Lcn2") # Neutrophil Markers
panel2 <- c("Ccr7", "Fscn1") # DC
panel3 <- c("Csf1r", "Mertk") # Mac

panels <- list(panel1, panel2, panel3)

plot_tsne_gene(ex_sc_norm, gene = unlist(panels), title = "", log_scale = T)

names(panels) <- c("Neutrophil", "Dendritic", "Macrophage")

ex_sc_norm <- flow_filter(ex_sc_norm, panels = panels, title = "Flow Pass Cells")

ex_sc_norm <- flow_svm(ex_sc_norm, pcnames = "Comp")
Loading required package: lattice
plot_tsne_metadata(ex_sc_norm, color_by = "cluster_spectral", title = "Spectral Cluster on PCA components")

plot_tsne_metadata(ex_sc_norm, color_by = "SVM_Classify", title = "Spectral Cluster on PCA components")

DE analysis

Now that cells are grouped by their cell type, we can run DE in order to determine which genes are change in association with our experimental conditions.

For simplicity we can subset to 0hr and 4hr, to find the genes that change between these conditions.

It should be noted that DE should always be run on raw counts, not on the normalized counts!

ex_sc_norm_0_4 <- subset_ex_sc(ex_sc_norm, variable = "Timepoint", select = c("0hr", "4hr"))
[1] "Subsetted Data is 2808 cells"

 0hr  4hr 
1729 1079 
findDEgenes(input = ex_sc,
            pd = pData(ex_sc_norm_0_4),
            DEgroup = "Timepoint",
            contrastID = "0hr",
            facet_by = "SVM_Classify",
            outdir = "~/Downloads/")
[1] "Performing DE for Dendritic"
Calculating norm factors...
Estimating dispersion...
Doing likelihood ratio fit...
[1] "Performing DE for Macrophage"
Calculating norm factors...
Estimating dispersion...
Doing likelihood ratio fit...
[1] "Performing DE for Neutrophil"
Calculating norm factors...
Estimating dispersion...
Doing likelihood ratio fit...
plot_volcano(de_path = "~/Downloads/", de_file = "Macrophage_0hr_DEresults.tsv", fdr_cut = 0.000001, logfc_cut = 2)

plot_violin(ex_sc_norm_0_4, color_by = "Timepoint", facet_by = "SVM_Classify", gene = "Cxcl9")

plot_violin(ex_sc_norm_0_4, color_by = "Timepoint", facet_by = "SVM_Classify", gene = "Rsad2")

Homework

Now try to run DE between the 0hr and 1hr timepoints on the mouse data. Then make a volcano plot of the genes that are significantly changed (FDR < 0.001, logfc_cut >= 2) within Dendritic cells.

# ex_sc_norm_0_1 <- subset_ex_sc()

# findDEgenes() 

# plot_volcano()